CP-NAS: Child-Parent Neural Architecture Search for 1-bit CNNs
103
Algorithm 9 Child-Parent NAS
Input: Training data, Validation data
Parameter: Searching hyper-graph: G, K = 8, selection(o(i,j)
k
) = 0 for all edges
Output: Optimal structure α
1: while (K > 1) do
2:
for t = 1, ..., T epoch do
3:
for e = 1, ..., K epoch do
4:
Select an architecture by sampling (without replacement) one operation from
O(i,j) for every edge;
5:
Construct the Child model and Parent model with the same selected architecture,
and then train both models to get the accuracy on the validation data;
Use Eq.4.15 to compute the performance and assign that to all the sampled
operations;
6:
end for
7:
end for
8:
Update e(o(i,j)
k
) using Eq. 4.28;
9:
Reduce the search space {O(i,j)} with the worst performance evaluation by e(o(i,j)
k
) ;
10:
K = K −1;
11: end while
12: return solution
4.3.3
Search Strategy for CP-NAS
As shown in Fig. 4.4, we randomly sample one operation from the K operations in O(i,j)
for every edge and then obtain the performance based on Eq. 4.15 by training the sampled
parent and child networks for one epoch. Finally, we assign this performance to all the
sampled operations. These steps are performed K times by sampling without replacement,
giving each operation exactly one accuracy for every edge for fairness.
We repeat the complete sampling process T times. Thus, each operation for every edge
has T performance {z(i,j)
k,1 , z(i,j)
k,2 , ..., z(i,j)
k,T } calculated by Eq. 4.15. Furthermore, to reduce
the undesired fluctuation in the performance evaluation, we normalize the performance of
K operations for each edge to obtain the final evaluation indicator as
e(o(i,j)
k
) =
exp{¯z(i,j)
k
}
k′ exp{¯z(i,j)
k′
}
,
(4.16)
where ¯z(i,j)
k
= 1
T
t z(i,j)
k,t . Along with increasing epochs, we progressively abandon the worst
evaluation operation from each edge until there is only one operation for each edge.
4.3.4
Optimization of the 1-Bit CNNs
Inspired by XNOR and PCNN, we reformulate our unified framework’s binarized optimiza-
tion as Child-Parent optimization.
To binarize the weights and activations of CNNs, we introduce the kernel-level Child-
Parent loss for binarized optimization in two respects. First, we minimize the distribution
between the full-precision and corresponding binarized filters. Second, we minimize the
intraclass compactness based on the output features. We then have a loss function, as
L ˆ
H =
c,l
MSE(Hl
c, ˆHl
c) + λ
2
s
∥fC,s( ˆH) −f C,s(H)∥2,
(4.17)